Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

作者信息

莱斯大学

链接：[2305.17118] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

摘要

Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. 【只有比较重要的token才会显著影响后续生成】Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model【固定KV Cache Budget】. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.

一句话总结概括

保留重要的KV Cache

Motivation

Repetitive Attention Pattern

取了三个不同的位置，Attention分数高的地方都是比较相似的

Persistence of Importance Hypothesis

只有在上一个step影响比较大的token，才会在下一个step生成中发挥重要作用。

对于生成的token，它attention计算中分数比较高token，在之前token生成的attention计算中值就应该比较高。

创新点或贡献

具体设计

实验评估

背景

先前工作存在的问题概述

难点

补充背景

思考角度

我如何做这个问题

这个洞见可以引申出其他其他方法吗

该洞见是否可以迁移到其他领域中

该工作有什么可能可以改进的地方

Q&A

results matching ""

No results matching ""